Data corruption

Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer storage and transmission systems use a number of measures to provide data integrity, or lack of errors.

In general, when data corruption occurs, the file containing that data may become inaccessible, and the system or the related application will give an error. For example, if a Microsoft Word file is corrupted, when you try to open that file with MS Word, you will get an error message, and the file would not be opened. Some programs can give a suggestion to repair the file automatically (after the error), and some programs cannot repair it. It depends on the level of corruption, and the in-built functionality of the application to handle the error. There are various causes of the corruption.

Contents

Transmission

Data corruption during transmission has a variety of causes. Interruption of data transmission causes information loss. Environmental conditions can interfere with data transmission, especially when dealing with wireless transmission methods. Heavy clouds can block satellite transmissions. Wireless networks are susceptible to interference from devices such as microwave ovens.

Storage

Data loss during storage has two broad causes: hardware and software failure. Background radiation, head crashes, and aging or wear of the storage device fall into the former category, while software failure typically occurs due to bugs in the code.

Error detection and correction may occur in the hardware, the disk subsystem or adapter, or software which implements error checking and correction (i.e., RAID software such as mdadm for Linux).

There are two types of data loss:

Countermeasures

When data corruption behaves as a Poisson process, where each bit of data has an independently low probability of being changed, data corruption can generally be detected by the use of checksums, and can often be corrected by the use of error correcting codes.

If an uncorrectable data corruption is detected, procedures such as automatic retransmission or restoration from backups can be applied. Certain levels of RAID disk arrays have the ability to store and evaluate parity bits for data across a set of hard disks and can reconstruct corrupted data upon the failure of a single or multiple disks, depending on the level of RAID implemented.

Today, many errors are detected and corrected by the disk drive using the ECC/CRC codes[1] which are stored on disk for each sector. If the disk drive detects multiple read errors on a sector it may make a copy of the failing sector on another part of the disk- remapping the failed sector of the disk to a spare sector without the involvement of the operating system (though this may be delayed until the next write to the sector).

This "silent correction" can lead to other problems if disk storage is not managed well, as the disk drive will continue to remap sectors until it runs out of spares, at which time the temporary correctable errors can turn into permanent ones as the disk drive deteriorates. S.M.A.R.T. provides a standardized way of monitoring the health of a disk drive, and there are tools available for most operating systems to automatically check the disk drive for impending failures by watching for deteriorating SMART parameters.

"Data scrubbing" is another method to reduce the likelihood of data corruption, as disk errors are caught and recovered from, before multiple errors accumulate and overwhelm the number of parity bits. Instead of parity being checked on each read, the parity is checked during a regular scan of the disk, often done as a low priority background process. Note that the "data scrubbing" operation activates a parity check. If a user simply runs a normal program that reads data from the disk, then the parity would not be checked unless parity-check-on-read was both supported and enabled on the disk subsystem.

If appropriate mechanisms are employed to detect and remedy data corruption, data integrity can be maintained. This is particularly important in commercial applications (e.g. banking), where an undetected error could either corrupt a database index or change data to drastically affect an account balance, and in the use of encrypted or compressed data, where a small error can make an extensive dataset unusable.[2] It is worth noting that while the study by CERN has been often referenced as showing large levels of data corruption, the disk subsystem which was the subject of the paper was set up with RAID5 and a single parity bit (hence could not recover from a single "silent" error), did not use parity-check-on-read (and hence could not detect "silent errors" through parity checking of the RAID stripe), and did not use data scrubbing. The disk storage was also subject to a microcode software bug which caused higher levels of errors than normal.[3]

See also

Computer Science portal
Computing portal

Solutions


References

  1. ^ "Read Error Severities and Error Management Logic". http://www.storagereview.com/guide/errorRead.html. Retrieved 24 July 2011. 
  2. ^ Data Integrity by Cern April 2007 Cern.ch
  3. ^ Bernd Panzer-Steindel. "Data integrity". http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797. There are some correlations with known problems, like the problem where disks drop out of the RAID5 system on the 3ware controllers. After some long discussions with 3Ware and our hardware vendors this was identified as a problem in the WD disk firmware.